These are the projects and labs completed for Data Visualization (STAT 302) - Spring 2020.
The goals of this lab are to (1) ensure that the major software for this course is properly installed and functional, (2) develop and follow a proper workflow, and (3) work together to construct a few plots to explore a dataset using ggplot2 — demonstration of the utility and power of ggplot2.
Don’t worry if you cannot do everything here by yourself. You are just getting started and the learning curve is steep, but remember that the instructional team and your classmates will be there to provide support. Persevere and put forth an honest effort and this course will payoff.
Load Packages tidyverse, ggstance, skimr
# Load package(s)
suppressPackageStartupMessages(library(tidyverse))
#suppressPackageStartupMessages(library(ggstance))
suppressPackageStartupMessages(library(skimr))We’ll be using data from the lego package which is already in the /data subdirectory, along with many other processed datasets, as part of the zipped folder for this lab.
Let’s look at some interesting patterns in the history of LEGO! We’ll be using data from the lego package located data/legosets.rda. We will work through this exercise together in class.
The lego package provides a helpful dataset some interesting variables. Let’s take a quick look at the data.
## Rows: 6,172
## Columns: 14
## $ Item_Number <chr> "10246", "10247", "10248", "10249", "10581", "10582", ...
## $ Name <chr> "Detective's Office", "Ferris Wheel", "Ferrari F40", "...
## $ Year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ...
## $ Theme <chr> "Advanced Models", "Advanced Models", "Advanced Models...
## $ Subtheme <chr> "Modular Buildings", "Fairground", "Vehicles", "Winter...
## $ Pieces <int> 2262, 2464, 1158, 898, 13, 39, 32, 105, 13, 11, 52, 13...
## $ Minifigures <int> 6, 10, NA, NA, 1, 2, 2, 3, 2, 2, 3, 1, NA, NA, NA, NA,...
## $ Image_URL <chr> "http://images.brickset.com/sets/images/10246-1.jpg", ...
## $ GBP_MSRP <dbl> 132.99, 149.99, 69.99, 59.99, 9.99, 16.99, 19.99, 49.9...
## $ USD_MSRP <dbl> 159.99, 199.99, 99.99, 79.99, 9.99, 19.99, 24.99, 59.9...
## $ CAD_MSRP <dbl> 199.99, 229.99, 119.99, NA, 12.99, 24.99, 29.99, 69.99...
## $ EUR_MSRP <dbl> 149.99, 179.99, 89.99, 69.99, 9.99, 19.99, 24.99, 59.9...
## $ Packaging <chr> "Box", "Box", "Box", "Box", "Box", "Box", "Box", "Box"...
## $ Availability <chr> "Retail - limited", "Retail - limited", "LEGO exclusiv...
| Name | legosets |
| Number of rows | 6172 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Item_Number | 0 | 1 | 1 | 13 | 0 | 5854 | 0 |
| Name | 0 | 1 | 2 | 73 | 0 | 5519 | 4 |
| Theme | 0 | 1 | 4 | 28 | 0 | 115 | 0 |
| Subtheme | 0 | 1 | 0 | 32 | 2206 | 358 | 1 |
| Image_URL | 0 | 1 | 46 | 58 | 0 | 6172 | 0 |
| Packaging | 0 | 1 | 3 | 21 | 0 | 14 | 0 |
| Availability | 0 | 1 | 6 | 21 | 0 | 8 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Year | 0 | 1.00 | 2004.71 | 8.91 | 1971.00 | 2000.00 | 2006.00 | 2012.00 | 2015.00 |
| Pieces | 112 | 0.98 | 215.17 | 356.20 | 0.00 | 30.00 | 82.00 | 256.25 | 5922.00 |
| Minifigures | 2672 | 0.57 | 2.85 | 2.72 | 1.00 | 1.00 | 2.00 | 4.00 | 32.00 |
| GBP_MSRP | 1980 | 0.68 | 23.45 | 31.93 | 0.00 | 5.99 | 12.99 | 29.99 | 509.99 |
| USD_MSRP | 355 | 0.94 | 27.90 | 39.32 | 0.00 | 6.00 | 14.99 | 34.99 | 789.99 |
| CAD_MSRP | 4190 | 0.32 | 46.34 | 58.46 | 2.99 | 12.99 | 24.99 | 54.99 | 789.99 |
| EUR_MSRP | 4399 | 0.29 | 35.98 | 46.61 | 0.00 | 9.99 | 19.99 | 39.99 | 699.99 |
Notice there are a lot of missing variables, especially when it comes to pricing - this will be important for when we calculate the means.
First, let’s look at the average cost of LEGO sets over time. The main variable of interest here is USD_MSRP, or the manufacturer’s suggested retail price in constant dollars (i.e. not adjusted for inflation).
#`drop` NA's at top
df.1b <- legosets %>%
drop_na(USD_MSRP) %>%
group_by(Year) %>%
summarize(
AveragePrice = mean(USD_MSRP)
)
#`remove` NA's in mean function
#equivalent to `drop` NAs at top for this task
#df.1b <- legosets %>%
# group_by(Year) %>%
# summarize(
# AverageCost = mean(USD_MSRP, na.rm=TRUE)
# )
ggplot(df.1b, aes(Year, AveragePrice))+
geom_point(size=1.25)+
geom_line(size=1)+
ggtitle("LEGO Sets: Average Price vs Year")+
theme_minimal()+
labs( #lables on the plot
x = "Year"
, y = "Average Price (USD)"
, subtitle = "sets without price are removed from calculation of average price"
, caption = "Source: LEGO"
)Is the increase in price simply due to inflation? Could get data from federal resersve and plot inflation-adjusted price vs year.
Next, let’s look at how the number of pieces per set has changed over time. Because Duplo sets are much smaller (since they’re designed for toddlers), we’ll make a special indicator variable for them.
df.1c <- legosets %>%
mutate(IsDuplo = ifelse(Theme == "Duplo", "Duplo", "Lego")) %>%
drop_na(Pieces) %>%
group_by(Year, IsDuplo) %>%
summarize(
AveragePieces = mean(Pieces)
,num_sets = n()
)
#same plot
ggplot(df.1c, aes(Year, AveragePieces, color=IsDuplo))+
geom_point()+
geom_line()+
ggtitle("Lego Sets: Average Pieces vs Year")+
labs( #lables on the plot
x = "Year"
, y = "Average Pieces"
, subtitle = "sets without pieces are removed from calculation of average pieces"
, caption = "Source: LEGO"
)#facet plots
ggplot(df.1c, aes(Year, AveragePieces, color=IsDuplo))+
facet_wrap(~IsDuplo, ncol=2, scales="fixed")+
geom_point()+
geom_line()+
ggtitle("Lego Sets: Average Pieces vs Year")+
labs( #lables on the plot
x = "Year"
, y = "Average Pieces"
, subtitle = "sets without pieces are removed from calculation of average pieces"
, caption = "Source: LEGO"
)+
theme(
legend.position = "none"
)
df.1c <- legosets %>%
mutate(IsDuplo = ifelse(Theme == "Duplo", "Duplo", "Lego")) %>%
drop_na(Pieces) %>%
filter(Year > 1985) %>%
group_by(Year, IsDuplo) %>%
summarize(
AveragePieces = mean(Pieces)
,num_sets = n()
)
#same plot
ggplot(df.1c, aes(Year, AveragePieces, color=IsDuplo))+
geom_point()+
geom_line()+
ggtitle("Lego Sets: Average Pieces vs Year")+
labs( #lables on the plot
x = "Year"
, y = "Average Pieces"
, subtitle = "sets without pieces are removed from calculation of average pieces"
, caption = "Source: LEGO"
)#facet plots
ggplot(df.1c, aes(Year, AveragePieces, color=IsDuplo))+
facet_wrap(~IsDuplo, ncol=2, scales="fixed")+
geom_point()+
geom_line()+
ggtitle("Lego Sets: Average Pieces vs Year")+
labs( #lables on the plot
x = "Year"
, y = "Average Pieces"
, subtitle = "sets without pieces are removed from calculation of average pieces"
, caption = "Source: LEGO"
)+
theme(
legend.position = "none"
)In the 1990s, LEGO began partnering with famous brands and franchises to boost its own sales. First, let’s see how many different “themes” LEGO now offers:
## [1] 115
## # A tibble: 1 x 1
## n_themes
## <int>
## 1 115
#counts of the themes
df.1d <- legosets %>%
#count data, automatically labeled 'n'
count(Theme, sort = TRUE
, name ="theme_count" #personalize name if wnat
) %>%
#turning character chr into factor fct -- meaning a categorical variable
mutate(Theme = fct_inorder(Theme, ordered = TRUE))
df.1d %>%
filter(theme_count > 150, Theme != 'Duplo') %>% #only look at top ones
ggplot(aes(x=fct_rev(Theme), y= theme_count)) +
geom_col()+
coord_flip()+
labs(y="Number of Sets", x=NULL)The goal of this lab is to begin the process of unlocking the power of ggplot2 through constructing and experimenting with a few basic plots.
We’ll be using data from the blue_jays.rda dataset which is already in the /data subdirectory in our data_vis_labs project. Below is a description of the variables contained in the dataset.
BirdID - ID tag for birdKnownSex - Sex coded as F or MBillDepth - Thickness of the bill measured at the nostril (in mm)BillWidth - Width of the bill (in mm)BillLength - Length of the bill (in mm)Head - Distance from tip of bill to back of head (in mm)Mass - Body mass (in grams)Skull - Distance from base of bill to back of skull (in mm)Sex - Sex coded as 0 = female or 1 = maleWe’ll also be using a subset of the BRFSS (Behavioral Risk Factor Surveillance System) survey collected annually by the Centers for Disease Control and Prevention (CDC). The data can be found in the provided cdc.txt file — place this file in your /data subdirectory. The dataset contains 20,000 complete observations/records of 9 variables/fields, described below.
genhlth - How would you rate your general health? (excellent, very good, good, fair, poor)exerany - Have you exercised in the past month? (1 = yes, 0 = no)hlthplan - Do you have some form of health coverage? (1 = yes, 0 = no)smoke100 - Have you smoked at least 100 cigarettes in your life time? (1 = yes, 0 = no)height - height in inchesweight - weight in poundswtdesire - weight desired in poundsage - in yearsgender - m for males and f for femalesNotice we are setting a seed. This signifies we will be doing something that relies on a random process (e.g., random sampling). In order for our results to be reproducible we set the seed. This ensures that every time you run the code or someone else does, it will produce the exact same output. It is good coding etiquette to set the seed towards the top of your document/code.
Complete the following exercises.
Using blue_jay dataset, construct the following scatterplots of Head by Mass:
color aesthetic set to Northwestern purple (#4E2A84), shape aesthetic set a solid/filled triangle, and size aesthetic set to 2.Sex or KnownSex mapped to the color aesthetic. That is, determine which is more appropriate and explain why. Also set the size aesthetic to 2.Consider the color aesthetic in the plots for (1) and (2). Explain why these two usages of the color aesthetic are meaningfully different.
## BirdID KnownSex BillDepth BillWidth BillLength Head Mass Skull Sex
## 1 0000-00000 M 8.26 9.21 25.92 56.58 73.30 30.66 1
## 2 1142-05901 M 8.54 8.76 24.99 56.36 75.10 31.38 1
## 3 1142-05905 M 8.39 8.78 26.07 57.32 70.25 31.25 1
## 4 1142-05907 F 7.78 9.30 23.48 53.77 65.50 30.29 0
## 5 1142-05909 M 8.71 9.84 25.47 57.32 74.90 31.85 1
## 6 1142-05911 F 7.28 9.30 22.25 52.25 63.90 30.00 0
ggplot(blue_jays, aes(Mass, Head))+
geom_point(color="#4E2A84", shape=17, size=2)+
ggtitle("Blue Jays: Head vs Mass")ggplot(blue_jays, aes(Mass, Head, color=KnownSex))+
geom_point(size=2)+
ggtitle("Blue Jays: Head vs Mass")The first plot uses colors for a purley subjective reason, a preferance for purple, but does not have any aesthetic effect on the data. The second plot uses the aesthetic color to section the data into two groups (Female and Male); this plot is using color as part of the mapping of the data to aesthetics.
Using a random subsample of size 100 from the cdc dataset (code provided below), construct a scatterplot of weight by height. Construct 5 more scatterplots of weight by height that make use of aesthetic attributes color and shape (maybe size too). You can define both aesthetics at the same time in each plot or one at a time. Just experiment. — Should be six total plots.
# Read in the cdc dataset
cdc <- read_delim(file = "data/cdc.txt", delim = "|") %>%
mutate(genhlth = factor(genhlth,
levels = c("excellent", "very good", "good", "fair", "poor")
))
# Selecting a random subset of size 100
cdc_small <- cdc %>% sample_n(100)#plot 1 plain
ggplot(cdc_small, aes(height, weight))+
geom_point(size=2, alpha=0.5)+
ggtitle("Weight by Height: Plain")#plot 2 gender
ggplot(cdc_small, aes(height, weight, shape=gender, color=gender))+
geom_point(size=2, alpha=.5)+
ggtitle("Weight by Height: Gender")#plot 3 general health
ggplot(cdc_small, aes(height, weight, color=genhlth, shape=genhlth))+
geom_point(size=2)+
ggtitle("Weight by Height: General Health")#plot 4 age
ggplot(cdc_small, aes(height, weight, color=age, size=age))+
geom_point()+
ggtitle("Weight by Height: Age")#plot 5 smoking
ggplot(cdc_small, aes(height, weight, color=as.factor(smoke100)))+
geom_point(size=2)+
ggtitle("Weight by Height: Smoking")+
scale_color_discrete(
name="Smoked at least \n100 cigarettes"
,label=c("No", "Yes")
)#plot 5 healthcare coverage
ggplot(cdc_small, aes(height, weight, color=as.factor(hlthplan)))+
geom_point(size=2)+
ggtitle("Weight by Height: Healthcare")+
scale_color_discrete(name="Health Care \nCoverage", label=c("No", "Yes"))The goal of this lab is to continue the process of unlocking the power of ggplot2 through constructing and experimenting with a few basic plots.
We’ll be using data from the BA_degrees.rda and dow_jones_industrial.rda datasets which are already in the /data subdirectory in our data_vis_labs project. Below is a description of the variables contained in each dataset.
BA_degrees.rda
field - field of studyyear_str - academic year (e.g. 1970-71)year - closing year of academic yearcount - number of degrees conferred within a field for the yearperc - field’s percentage of degrees conferred for the yeardow_jones_industrial.rda
date - dateopen - Dow Jones Industrial Average at openhigh - Day’s high for the Dow Jones Industrial Averagelow - Day’s low for the Dow Jones Industrial Averageclose - Dow Jones Industrial Average at closevolume - number of trades for the dayWe’ll also be using a subset of the BRFSS (Behavioral Risk Factor Surveillance System) survey collected annually by the Centers for Disease Control and Prevention (CDC). The data can be found in the provided cdc.txt file — place this file in your /data subdirectory. The dataset contains 20,000 complete observations/records of 9 variables/fields, described below.
genhlth - How would you rate your general health? (excellent, very good, good, fair, poor)exerany - Have you exercised in the past month? (1 = yes, 0 = no)hlthplan - Do you have some form of health coverage? (1 = yes, 0 = no)smoke100 - Have you smoked at least 100 cigarettes in your life time? (1 = yes, 0 = no)height - height in inchesweight - weight in poundswtdesire - weight desired in poundsage - in yearsgender - m for males and f for femalesThe following exercises use the BA_degrees data set.
# Wrangling for plotting
ba_dat <- BA_degrees %>%
# mean % per field
group_by(field) %>%
mutate(mean_perc = mean(perc)) %>%
# Only fields with mean >= 5%
filter(mean_perc >= 0.05) %>%
# Organizing for plotting
arrange(desc(mean_perc), year) %>%
ungroup() %>%
mutate(field = fct_inorder(field))
#take a look at the data to see variable names and types
head(ba_dat)## # A tibble: 6 x 6
## field year_str year count perc mean_perc
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Business 1970-71 1971 115396 0.137 0.204
## 2 Business 1975-76 1976 143171 0.155 0.204
## 3 Business 1980-81 1981 200521 0.214 0.204
## 4 Business 1985-86 1986 236700 0.240 0.204
## 5 Business 1990-91 1991 249165 0.228 0.204
## 6 Business 1995-96 1996 226623 0.195 0.204
#percent of degrees conferred by year
ggplot(ba_dat, aes(year, perc)) +
geom_line(size=1) +
#make chart for each field
facet_wrap(~field) +
xlab("Year") +
ylab("Proportion of degrees")#percent of degrees conferred by year
ggplot(ba_dat, aes(year, perc)) +
#colored area may make comparison between field easier
geom_area(alpha=0.5, fill="red") +
#line on areaa (may want a different color )
geom_line(color="red", size=1) +
#make chart for each field
facet_wrap(~field) +
xlab("Year") +
ylab("Proportion of degrees")The following exercises use the dow_jones-industrial data set.
# Restrict data to useful range
djia_date_range <- dow_jones_industrial %>%
filter(date >= ymd("2008/12/31") & date <= ymd("2010/01/10"))
#take a look at the data to see variable names and types
head(djia_date_range)## # A tibble: 6 x 6
## date open high low close volume
## <date> <dbl> <dbl> <dbl> <dbl> <int>
## 1 2008-12-31 8666. 8843. 8665. 8776. 226760000
## 2 2009-01-02 8772. 9065. 8761. 9035. 213700000
## 3 2009-01-05 9027. 9034. 8892. 8953. 233760000
## 4 2009-01-06 8955. 9088. 8941. 9015. 215410000
## 5 2009-01-07 8997. 8997. 8720. 8770. 266710000
## 6 2009-01-08 8770. 8770. 8651. 8742. 226620000
#closing price of DJIA by date
ggplot(djia_date_range, aes(date, close)) +
geom_line(size = 1, color="purple") +
geom_smooth(
method = "loess"
, formula = y ~ x
#line color
, color = "green"
#se shading color
, fill = "red"
) +
xlab(NULL) +
ylab("Dow Jones Industrial Average")#closing price of DJIA by date
ggplot(djia_date_range, aes(date, close)) +
geom_line(size=1) +
geom_smooth(
method="loess"
, formula= y ~ x
#wiggliness for loess curve
, span=0.3
#don't show se
, se=FALSE
) +
xlab(NULL) +
ylab("Dow Jones Industrial Average")ggplot(djia_date_range, aes(date, close)) +
geom_line(size=1) +
geom_smooth(
method="lm"
#splines makes the line more curvy and less 'chopped up'
#splines use piecewise polynomials
#lowess is 'local regression'
, formula= y ~ splines::ns(x, 6)
, span=0.3
, se = FALSE
) +
xlab(NULL) +
ylab("Dow Jones Industrial Average")The following exercises use the cdc dataset.
## # A tibble: 6 x 9
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 good 0 1 0 70 175 175 77 m
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 4 good 1 1 0 66 132 124 42 f
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
Using cdc and `geom_bar()
Using genhlth_count and geom_bar(stat="identity")
#count genhlth and put into new data set
genhlth_count <- cdc %>%
count(genhlth)
#take a look at the data to see variable names and types
genhlth_count## # A tibble: 5 x 2
## genhlth n
## <fct> <int>
## 1 excellent 4657
## 2 very good 6972
## 3 good 5675
## 4 fair 2019
## 5 poor 677
ggplot(genhlth_count, aes(genhlth, n)) +
#use stat='identity' to plot actual values; rather than counting values
geom_bar(stat="identity") +
ylab("count") Using
genhlth_count and geom_col()
ggplot(genhlth_count, aes(genhlth, n)) +
#geom_col will authomatically plot values
# where as geom_bar requires stat="identity" to do so
geom_col() +
ylab("count")ggplot(cdc, aes(genhlth, fill=as.factor(hlthplan))) +
geom_bar(position="dodge") +
#change title on legend
labs(fill="Health Plan") ggplot(cdc
, aes(
#variable used to calculate density
weight
#color of area
, fill = genhlth
#color of line
, color = genhlth
)
) +
geom_density(alpha = 0.2) +
# create plots for each gender
facet_wrap(~gender)The weight variable should have a lower limit of 50 and an upper limit of 300.
Fix x-axis limits
ggplot(cdc, aes(weight, fill = gender, color = gender)) +
geom_density(alpha = 0.5) +
facet_wrap(~genhlth, scales = "fixed") +
xlim(50, 300)Filter cdc df to weight values [50, 300]
cdc %>%
#filter data to only include values within the limts
filter(weight >= 50, weight <= 300) %>%
#pip directly into ggplot
ggplot(aes(weight, fill = gender, color=gender)) +
geom_density(alpha=0.5) +
facet_wrap(~genhlth, scales = "fixed")The goal of this lab is to continue the process of unlocking the power of ggplot2 through constructing and experimenting with a few basic plots.
Load Packages: tidyverse, gridExtra, ggrepel
We’ll be using data from the blue_jays.rda, tech_stocks.rda, corruption.rda, and cdc.txt datasets which are already in the /data subdirectory in our data_vis_labs project.
# Load datasets
load(file = "data/blue_jays.rda")
load(file = "data/tech_stocks.rda")
load(file = "data/corruption.rda")
# Read in the cdc dataset
cdc <- read_delim(
file = "data/cdc.txt"
, delim = "|"
) %>%
mutate(genhlth = factor(genhlth,
levels = c("excellent", "very good", "good", "fair", "poor")
))The following plot uses the blue_jays.rda dataset.
#create caption that automatically grabs number of blue jays
caption <- paste("Head length versus body mass for", nrow(blue_jays), "blue jays")
#add string wrap
#will break the caption into multiple lines if longer than 40 characters
# '\n' is the line break code
caption_print <- paste(strwrap(caption, 40), collapse ="\n")
#create data set for top head size for each sex
topHead <- blue_jays %>%
#arrange largest to smallest
arrange(desc(Head)) %>%
# group by sex
group_by(KnownSex) %>%
#take the top 2 head sizes for each group
top_n(n = 2, wt = Head)
#'M' label will be put on the top male head size
#'F' label will be put on the 2nd top female head size
Labels <- topHead[c(1,4),]
#ANOTHER OPTIONS: lable dataframe
#search by BirdID
Labels_anotheroption <- blue_jays %>%
#select specific bird where you want the labels
filter(BirdID %in% c("1142-05914", "702-90567"))
#get range for x and y variables
xrng <- range(blue_jays$Mass)
yrng <- range(blue_jays$Head)
#head length by body mass
ggplot(blue_jays, aes(Mass, Head, color = KnownSex)) +
geom_point(alpha = 0.6, size = 2) +
annotate(
"text"
#put text in the top left corner of plot
, x = xrng[1], y = yrng[2]
#label is the caption already create
, label = caption
#left justify
, hjust = 0
# bottom justify
, vjust = 1
#size the font
, size = 4
) +
xlab("Body mass (g)") +
ylab("Head length (mm)") +
#remove all legends; to remove just one legend put show.legend = FALSE into geom
theme(legend.position = "none") +
# add labels
geom_text(
#use labels data set
data = Labels
, aes(label = KnownSex)
#nudget labels to the right
, nudge_x = 0.5
)The following plots use the tech_stocks dataset.
#create caption
caption <- paste("Stock price over time for four major tech companies")
#add string wrap
#will break the caption into multiple lines if longer than 40 characters
# '\n' is the line break code
caption_print <- paste(strwrap(caption, 40), collapse = "\n")
#create labels df with most recent stock values
Labels <- tech_stocks %>%
#order by date
arrange(desc(date)) %>%
# return distinct company
distinct(company
#keep all variables not just company
, .keep_all = TRUE
)
#get range for x and y variables
xrng <- range(tech_stocks$date)
yrng <- range(tech_stocks$price_indexed)
#stock price by date
ggplot(tech_stocks, aes(date, price_indexed)) +
#put color in geom_line so labels are not colored
geom_line(aes(color = company)) +
# remove all legends
theme(legend.position = "none") +
#remove x label
xlab(NULL) +
ylab("Stock price, indexed") +
annotate(
"text"
#put text in the top left corner of plot
, x = xrng[1], y = yrng[2]
#label is the caption already create
, label = caption
#left justify
, hjust = 0
# bottom justify
, vjust = 1
#use serif font
, family = "serif"
#size the font
, size = 4
) +
#add company labels to the most recent stock price
geom_text(data = Labels, aes(label = company))#stock price by date
ggplot(tech_stocks, aes(date, price_indexed)) +
#put color in geom_line so labels are not colored
geom_line(aes(color = company)) +
# remove all legends
theme(legend.position = "none") +
#remove x label
xlab(NULL) +
ylab("Stock price, indexed") +
annotate(
"text"
#put text in the top left corner of plot
, x = xrng[1], y = yrng[2]
#label is the caption already create
, label = caption
#left justify
, hjust = 0
# bottom justify
, vjust = 1
#use serif font
, family = "serif"
#size the font
, size = 4
) +
#add company labels to the most recent stock price
geom_text_repel(
data = Labels
, aes(label = company)
#padding around label
, box.padding = 0.6
, min.segment.length = 0
#right align text (so doesn't go off the plot)
, hjust = 1
#set seed for repel iterations
, seed = 9876
)The following plot uses the corruption.rda dataset.
#look at 2015 data
corruption_2015 <- corruption %>%
#remove countries that don't have cpi or hdi values
drop_na(cpi, hdi) %>%
#only look at 2015 data
filter(year == 2015)
#the contries we want to label
CountriesToLabel <- c(
"Niger"
, "Iraq"
, "China"
, "Ghana"
, "Argentina"
, "Chile"
, "Japan"
, "United States"
, "Singapore"
)
#create labels data frame
Labels <- corruption_2015 %>%
#only select specific countries already specified
filter( country %in% CountriesToLabel)
#hdi by cpi
ggplot(corruption_2015, aes(cpi, hdi)) +
#put color in geom_line so labels are not colored, color by geom
geom_point(aes(color = region), alpha = 0.6, size = 2) +
geom_smooth(
method = "lm"
# use log(x) when modeling
, formula = y~log(x)
# specify grey color for line
, color = "grey60"
# remove standard error bars
, se = FALSE
) +
xlab("Corruption Perceptions Index, 2015 (100 = least corrupt)") +
#add y label; '\n' creates line break
ylab("Human Development Index, 2015\n (1.0 = most developed)") +
#add title
ggtitle("Corruption and human development (2015)")+
#add labels to specific countries
geom_text_repel(
data = Labels
, aes(label = country)
#set seed for repel iterations
, seed = 9876
, box.padding = 0.6
, min.segment.length = 0
)The next plot uses the cdc dataset.
Using Bilbo Baggins’ responses below to the CDC BRSFF questions, add Bilbo’s data point as a transparent (0.5) solid red circle of size 4 to a scatterplot of weight by height with transparent (0.1) solid blue circles of size 2 as the plotting characters. In addition, label the point with his name in red. Left justify and rotate the label so it reads vertically from bottom to top — shift it up by 10 pounds too. Plot should use appropriately formatted axis labels. Remember that the default shape is a solid circle.
genhlth - How would you rate your general health? fairexerany - Have you exercised in the past month? 1=yeshlthplan - Do you have some form of health coverage? 0=nosmoke100 - Have you smoked at least 100 cigarettes in your life time? 1=yesheight - height in inches: 46weight - weight in pounds: 120wtdesire - weight desired in pounds: 120age - in years: 45gender - m for males and f for females: mHint: Create a new dataset (maybe call it bilbo or bilbo_baggins) using either data.frame() (base R - example in book) or tibble() (tidyverse - see help documentation for the function). Make sure to use variable names that exactly match cdc’s variable names
bilbo <- tibble(
genhlth = "fair",
exerancy = 1,
hlthplan = 0,
smoke100 = 1,
height = 46,
weight = 120,
wtdesire = 120,
age = 45,
gender = "m"
)
ggplot(cdc, aes(height, weight)) +
geom_point(alpha = 0.1, color = "blue", size = 2) +
ggtitle("Bilbo Baggins is on the small side") +
labs(
x = "Height (inches)"
, y = "Weight (pounds)"
, subtitle = "CDC Data: Weight by Height"
) +
#add Bilbo Baggins special red point onto graph
geom_point(data = bilbo, color = "red", size = 4, alpha = 0.5) +
#add sing label
geom_text(
data = bilbo
, aes(label = "Bilbo Baggins")
, color = "red"
#left justify label
, hjust = 0
#rotate label
, angle = 90
#nudge label up 10 pounds
, nudge_y = 10
)The goal of this lab is to explore more useful plots in ggplot2. Specifically we will be focusing on surface plots and geospatial plots (maps).
Challenges are not mandatory for students to complete. We highly recommend students attempt them though. We would expect graduate students to attempt the challenges.
# Load package(s)
library(tidyverse)
library(gridExtra)
#for geom_hex
library(hexbin)
library(maps)
#https://github.com/hrbrmstr/statebins
library(statebins)
#color pallet https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html
library(viridis)
#for geom_sf; will not work with discrete fill unless the package is loaded
library(sf)
#for challegne with census data
library(tidycensus)
library(mapview)
library(tigris)
library(scales) #percent()
knitr::opts_chunk$set(dpi = 300) We’ll be using data from the blue_jays.rda, cdc.txt, and the US_income.rda datasets which are already in the /data subdirectory in our data_vis_labs project.
# Load datasets
load(file = "data/blue_jays.rda")
load(file = "data/US_income.rda")
# Read in the cdc dataset
cdc <- read_delim(
file = "data/cdc.txt"
, delim = "|"
#add col types
, col_types = cols(
genhlth = col_character(),
exerany = col_double(),
hlthplan = col_double(),
smoke100 = col_double(),
height = col_double(),
weight = col_double(),
wtdesire = col_double(),
age = col_double(),
gender = col_character()
)
) %>%
mutate(genhlth = factor(genhlth,
levels = c("excellent", "very good", "good", "fair", "poor")
))Complete the following exercises.
The following plot uses the blue_jays.rda dataset.
#body mass by head
ggplot(blue_jays, aes(Mass, Head)) +
#scatter plot
geom_point(alpha = 1/3, size = 1.5) +
#2D density plot
geom_density_2d(
#set binwidth for density plot
binwidth = 0.004
, color = "black"
, size = 0.4
) +
#set limits on x-axis
xlim(57, 82) +
# add labels
labs(
x = "Body Mass (g)"
, y = "Head length (mm)"
) +
#add minimal theme; white background on plot
theme_minimal()The following plots uses the cdc dataset
#set base plot that can be used for plot1&plot2
cdc_plot <- ggplot(cdc, aes(height, weight)) +
#set axis titles
labs(
x = "Height (in)"
, y = "Weight (lbs)"
)+
#add minimal theme; white background on plot
theme_minimal()
#2d density plot with hexagons
# requires `hexbins` package
cdc_plot + geom_hex(bins=35)The following plot is made using the maps library.
#requires `maps` package
#the best state in the union
mn <- map_data("county", "minnesota") %>%
select(long, lat, group, id = subregion)
ggplot(mn, aes(x = long, y = lat)) +
#add title+
ggtitle("Minnesota") +
#draws the shapes
geom_polygon(aes(group = group), fill = "white" , color = "grey35") +
#helps size correctely
coord_quickmap() +
#shows just the plotted features
theme_void()The following plots use the US_income dataset.
# Setting income levels
US_income <- mutate(
US_income,
income_bins = cut(
ifelse(is.na(median_income), 25000, median_income),
breaks = c(0, 40000, 50000, 60000, 70000, 80000),
labels = c("< $40k", "$40k to $50k", "$50k to $60k", "$60k to $70k", "> $70k"),
right = FALSE
)
)ggplot(US_income) +
# 'simple features', need `sf` package
geom_sf(
aes(
#specify geometry shape to be used
geometry = geometry
#fill based on median income
, fill = income_bins
)
#set boundary color
, color = "grey80"
#set boundary line size
, size = 0.2
) +
#scale color fill to viridis color pallet
viridis::scale_fill_viridis(
#change to continous scale
discrete = TRUE
#change legend name
, name = "Median\nIncome"
) +
#helps size correctely
coord_sf() +
#shows just the plotted features
theme_void()The following plots use the tidycensus package and few others, as well as using these directions.
Try using a different geographical area and a different variable from the ACS.
#GET CENSUS KEY
#install census key; ONLY NEED TO DO THIS ONCE
census_api_key("your census data api", install = TRUE, overwrite = TRUE)
#get census data api here: https://api.census.gov/data/key_signup.html
#run above; it is now stored and can be accessed using
Sys.getenv("CENSUS_API_KEY")
# https://walkerke.github.io/tidycensus/articles/basic-usage.html
#look through variables
var <- load_variables(2017, "acs5", cache = TRUE)
#then use View(var) to filter # SIZE: out.width="100%", fig.height=2.5
# https://walkerke.github.io/tidycensus/articles/spatial-data.html
options(tigris_use_cache = TRUE)
ny <- get_acs(geography = "tract",
variables = "B19013_001", #household median income
state = "NY",
county = "New York",
geometry = TRUE,
cb = FALSE)
#remove polygons over the water
st_erase <- function(x, y){
st_difference(x, st_union(y))
}
ny_water <- area_water("NY", "New York", class = "sf")
NY <- st_erase(ny, ny_water)
mapview(NY
, zcol = "estimate"
, legend = TRUE
#change size of lines
, lwd = 0.2)# SIZE: out.width="100%", fig.height=2.5
options(tigris_use_cache = TRUE)
twin_cities <- c("Hennepin", "Ramsey")
vars <- c(
renters = "B07013_003" #total renter-occupied units
, total = "B07013_001" #TOTAL
)
#get values
mn_val <- get_acs(
geography = "tract"
, variables = vars
, state = "MN",
, county = twin_cities
)
#great df with new values
#can't get pivot_wider to work when geometry = TRUE
#put together values seperatly and then add to mn_geom
MN_val <- mn_val %>%
pivot_wider(
names_from = variable,
values_from = c(estimate, moe)
) %>%
mutate(rent_perc = round(estimate_renters / estimate_total, 2)) %>%
select(GEOID, rent_perc)
#get geometric files
MN_geom <- get_acs(
geography = "tract"
, variables = vars[1]
, state = "MN",
, county = twin_cities
, geometry = TRUE
)
MN <- inner_join(MN_geom, MN_val, by = "GEOID")
mapview(MN
, zcol = "rent_perc"
, legend = TRUE
#make lines thinner
, lwd = 0.2
, layer.name = "Est % of Renters"
)The goal of this lab is to explore more plots in ggplot2. Specifically we will be focusing on error bars for uncertainty and practice using multiple layers.
We’ll be using the cows.rda, cdc.txt, and tech_stocks.rda datasets which are already in the /data subdirectory in our data_vis_labs project. We will also be using the mpg dataset which comes packaged with ggplot2 — use ?ggplot2::mpg to access its codebook.
#Load packages
library(tidyverse)
# Set seed
set.seed(9876)
#chunk options
knitr::opts_chunk$set(dpi = 300) # load data sets
load(file = "data/cows.rda")
load(file = "data/tech_stocks.rda")
# Read in the cdc dataset
cdc <- read_delim(
file = "data/cdc.txt"
, delim = "|"
#add col types
, col_types = cols(
genhlth = col_character(),
exerany = col_double(),
hlthplan = col_double(),
smoke100 = col_double(),
height = col_double(),
weight = col_double(),
wtdesire = col_double(),
age = col_double(),
gender = col_character()
)
) %>%
mutate(genhlth = factor(genhlth,
levels = c("excellent", "very good", "good", "fair", "poor")
))The following plot uses the mpg dataset.
# Additional dataset for plot
class_dat <- mpg %>%
group_by(class) %>%
summarise(
n = n(),
hwy = mean(hwy),
label = str_c("n = ", n, sep = "")
)ggplot(mpg, aes(class, hwy)) +
#addjitter plot with all of the points
geom_jitter(width = 0.1) +
#add red points at mean hwy mpg for each class
geom_point(data = class_dat, size = 5, color = "red", alpha = 0.6) +
#add labels with number of vechicals in each class
geom_text(data = class_dat, aes(label = label, y = 10), vjust = "indward") +
#change theme
theme_minimal() +
#add axis titles
labs(
x = "Vehicle Class"
, y = "Highway miles per gallon"
)The following plot uses the cows dataset.
# Graphic dataset
cow_means <- cows %>%
filter(breed != "Canadian") %>%
group_by(breed) %>%
summarize(
mean = mean(butterfat),
se = sd(butterfat) / sqrt(n())
) %>%
mutate(breed = fct_reorder(factor(breed), desc(mean)))z_star <- qnorm(0.975)
ggplot(cow_means, aes(breed, mean)) +
#add color and width
geom_col(fill = "#56B4E9", width = 0.7) +
geom_errorbar(
#add error bars
aes(
ymin = mean - se*z_star
, ymax = mean + se*z_star
)
, width = 0.1
) +
#change theme
theme_minimal() +
#add axis titles
labs(
x = "Cattle breed"
, y = "Mean percent butterfat\ncontent in milk"
)The following plot uses the tech_stocks dataset.
# percentage increase data
perc_increase <- tech_stocks %>%
ungroup(ticker) %>%
arrange(desc(date)) %>%
distinct(company, .keep_all = TRUE) %>%
mutate(
perc = 100 * (price - index_price) / index_price,
label = str_c(round(perc), "%", sep = ""),
company = fct_reorder(factor(company), perc)
)ggplot(perc_increase, aes(perc, company)) +
geom_col(fill = "#56B4E9") +
geom_text(aes(label = label), size = 5, hjust = 1.1, color = "white") +
#change theme
theme_minimal() +
#remove axis titles
labs(
x = NULL
, y = NULL
)The following plot uses the cdc dataset.
# 95% CI for weight for genhlth, gender groups
cdc_weight_95ci <- cdc %>%
group_by(genhlth, gender) %>%
summarise(
mean_wt = mean(weight),
se = sd(weight) / sqrt(n()),
moe = qt(0.975, n() - 1) * se
)ggplot(cdc_weight_95ci, aes(mean_wt, gender, color = genhlth)) +
geom_point(position = position_dodge(width = 0.5)) +
#add HORIZONTAL error bars
geom_errorbarh(
aes( xmin = mean_wt - moe
, xmax = mean_wt + moe
#change height
, height = 0.1
)
#width refers to size of dodged position; not error bars
, position = position_dodge(width = 0.5)
) +
theme_minimal() +
labs(
x = "Weight (lbs)"
, y = "Gender"
) +
scale_color_discrete(name = "General health\n(self reported)")The goal of this lab is to explore ways to manage and manipulate scales, axes, and legends within ggplot2.
We’ll be using the tech_stocks.rda, cdc.txt, and a few toy datasets.
# Load package(s)
library(tidyverse)
library(scales)
# Load datasets
load(file = "data/tech_stocks.rda")
# Read in the cdc dataset
cdc <- read_delim(file = "data/cdc.txt", delim = "|") %>%
mutate(genhlth = factor(genhlth,
levels = c("excellent", "very good", "good", "fair", "poor")
))
# Set seed
set.seed(8221984)
# Selecting a random subset of size 100
cdc_small <- cdc %>% sample_n(100)
# Generating toy datasets for exercise 2
dat1 <- tibble(theta = c(0, 2 * pi))
dat2 <- tibble(
theta = seq(0, 2 * pi, length.out = 100),
obs = rnorm(100, sin(theta), 0.1),
larger_than = ifelse(abs(obs) < abs(sin(theta)), "1", "0")
)The next plot uses the tech_stocks dataset.
Hints:
tech_stocks %>%
ungroup() %>%
ggplot(
aes(
date
, price_indexed
#reorder based on final values in plot
, color = fct_reorder2(company, date, price_indexed)
)
) +
theme_minimal() +
geom_line() +
#change line in legends to 1.3
#use color as that is the what is used in aes() for lines
guides(color = guide_legend(override.aes = list(size = 1.3))) +
scale_y_continuous(
name = NULL
, breaks = seq(0, 500, 100)
, labels = scales::dollar
, position = "right"
) +
scale_color_discrete(
name = NULL
#if don't reorder factors; can manually set in limits
# , limits = c("Facebook", "Alphabet", "Microsoft", "Apple")
)+
theme(
legend.position = c(0.75, 0.85)
) +
scale_x_date(
name = NULL
#remove extra padding on plot
, expand = c(0, 0)
)+
ggtitle("Stock price, indexed")The next plot uses the toy datasets dat1 and dat2.
Hints:
#56B4E9, darkgreen, & reddat2 %>%
mutate(sin_theta = sin(theta)) %>%
ggplot(aes(theta, obs, color = larger_than)) +
geom_point(
size = 2
, alpha = 0.8
) +
stat_function(
fun = sin
, size = 1.3
, color = "#56B4E9"
) +
#instead of stat_funnction could use geom_line
#geom_line(aes(y = sin_theta), size = 1.3, color = "#56B4E9") +
scale_color_manual(values = c("darkgreen", "red")) +
xlab( quote(theta) ) +
ylab( quote(sin(theta)) ) +
theme_minimal() +
theme(legend.position = "none")Using cdc_smallconstruct a scatterplot of weight by height with the following requirements:
genhlth."Set1" pallete.x (4) for poor, an hollow rotated square with an x in it (9) for fair, and a solid square (15) for good.height values should be limited between 55 and 80.height axis should display every 5th number between 55 and 80 and be appropriately labeled (i.e. 55 in, 60 in, …, 80 in). No axis title is necessary.weight values should be limited between 100 and 300.weight axis should be on log base 10 scale, but still display weights in pounds starting at 100 and displaying every 25 pounds until 300. Must be appropriately labeled (i.e. 100 lbs, 125 lbs, …, 300 lbs). No axis title is necessary.CDC BRFSS: Weight by Height.#limits and breaks for HEIGHT
htmin <- 55
htmax <- 80
htdiv <- 5
htnumb <- (htmax-htmin)/htdiv
htbreaks <- seq(htmin, htmax, htdiv)
#limits and breaks for WEIGHT
wtmin <- 100
wtmax <- 300
wtdiv <- 25
wtnumb <- (wtmax-wtmin)/wtdiv
wtbreaks <-seq(wtmin, wtmax, wtdiv)
#Legend title should be "General Health?" with a newline starting after general
leg_title <- "General\nHealth?"
#with each word in category capitalized in the legend
cap_labels <- c("Excellent", "VeryGood", "Good", "Fair", "Poor")
#PLOT
ggplot(cdc_small, aes(height, weight)) +
geom_point(aes(color = genhlth, shape = genhlth), size = 3) +
scale_color_brewer(
name = leg_title
, labels = cap_labels
, palette = "Set1"
) +
scale_shape_manual(
name = leg_title
, labels = cap_labels
, values = c(17, 19, 15, 9, 4)
) +
scale_x_continuous(
name = NULL
, limits = c(htmin, htmax)
, breaks = htbreaks
, labels = scales::unit_format(unit = "in")
) +
scale_y_log10(
name = NULL
, limits = c(wtmin, wtmax)
, breaks = wtbreaks
, labels = scales::unit_format(unit = "lbs")
) +
theme_minimal() +
theme(
legend.position = c(1, 0)
, legend.justification = c(1, 0)
) +
ggtitle("CDC BRFSS: Weight by Height")The goal of this lab is to develop an understanding of facets, position, continue the exploration of other ggplot2 options/features.
We’ll be using the titanic.rda, Aus_athletes.rda, and cdc.txt datasets.
# Load package(s)
library(tidyverse)
library(scales)
library(cowplot)
# Load datasets
load(file = "data/titanic.rda")
load(file = "data/Aus_athletes.rda")
# Read in the cdc dataset
cdc <- read_delim(file = "data/cdc.txt", delim = "|") %>%
mutate(genhlth = factor(genhlth,
levels = c("excellent", "very good", "good", "fair", "poor"),
labels = c("Excellent", "Very Good", "Good", "Fair", "Poor")
))
# Set seed
set.seed(8221984)
# Selecting a random subset of size 1000
cdc_small <- cdc %>% sample_n(1000)The following plot uses the titanic.rda dataset .
Hints:
#D55E00D0, #0072B2D0ggplot(titanic, aes(sex, fill = sex)) +
geom_bar() +
facet_grid(
factor(survived, labels = c("died", "survived"))
~
class
) +
scale_fill_manual(values = c("#D55E00D0", "#0072B2D0")) +
theme_minimal() +
theme(legend.position = "none")Use the athletes_dat dataset — extracted from Aus_althetes.rd — to recreate the following graphic as precisely as possible. The cowplot package will be useful.
# Get list of sports played by BOTH sexes
both_sports <- Aus_athletes %>%
distinct(sex, sport) %>%
count(sport) %>%
filter(n == 2) %>%
pull(sport)
# Process data
athletes_dat <- Aus_athletes %>%
filter(sport %in% both_sports) %>%
mutate(sport = case_when(
sport == "track (400m)" ~ "track",
sport == "track (sprint)" ~ "track",
TRUE ~ sport
)
)Hints:
cowplot::plot_grid() to combine them#D55E0040 and #0072B240 (bottom plot), #D55E00D0 & #0072B2D0 (for top two plots) — no alpha#D55E00 and #0072B2rcc: red blood cell count; wcc: white blood cell countplotA <- ggplot(athletes_dat, aes(sex, fill = sex)) +
geom_bar(show.legend = FALSE)+
scale_fill_manual(values = c("#D55E00D0", "#0072B2D0")) +
scale_x_discrete(
name = NULL
, labels = c("female", "male")
) +
scale_y_continuous(
name = "number"
, breaks = seq(0, 100, 25)
, limits = c(0, 95)
) +
theme_minimal()plotB <- ggplot(athletes_dat, aes(rcc, wcc)) +
geom_point(
aes(fill = sex)
, pch = 21
, color = "white"
, size = 3
, show.legend = FALSE
) +
scale_fill_manual(values = c("#D55E00D0", "#0072B2D0")) +
scale_x_continuous(name = "RBC Count") +
scale_y_continuous(name = "WBC count") +
theme_minimal()sex_labs <- c("female", "male")
plotC <- ggplot(athletes_dat, aes(sport, pcBfat)) +
geom_boxplot(
aes(color = sex, fill = sex)
, width = 0.5
) +
scale_fill_manual(
name = NULL
, labels = sex_labs
, values = c("#D55E0040", "#0072B240")) +
scale_color_manual(
name = NULL
, labels = sex_labs
, values = c("#D55E00D0", "#0072B2D0")) +
guides(fill = guide_legend(
ncol = 2
, override.aes = list(
fill = c("#D55E00D0", "#0072B2D0")
, color = "transparent"
)
)
) +
xlab(NULL) +
ylab("% body fat") +
theme_minimal() +
theme(
legend.position = c(1, 1)
, legend.justification = c(1, 1)
, legend.margin = margin(t=0)
)Using cdc_smallconstruct a the following graphic as precisely as possible.
Hints:
"#D55E00D0" and #0072B2D0grey80 might be usefulcdc_small_adj <- cdc_small %>%
mutate(wtloss = wtdesire - weight)
ggplot(cdc_small_adj, aes(weight, wtloss)) +
geom_point(
data = select(cdc_small_adj, -gender)
, color = "grey80"
, size = 2
) +
geom_point(
aes(color = gender)
, size = 2
, show.legend = FALSE
) +
scale_color_manual(values = c("#D55E00D0", "#0072B2D0")) +
facet_grid(
factor(gender, labels = c("Women", "Men"))
~
fct_rev(genhlth)
) +
xlab("Weight (lbs)") +
ylab( "Weight Lost/Gain in Pounds") +
theme_minimal()